Load Packages and Set Seed

# Load Necessary Packages
library(tidyverse)
library(tidymodels)
library(naniar)
library(skimr)
library(cowplot)

# Set Seed
set.seed(3206)

Initial Overview

The arts, from plays to musicals to live jazz, give us hope, meaning, connection and community. Particularly local arts brings us closer to the communities we engage with on a daily basis. Further, by better understanding who attends arts performances and exhibits and how often they do, as well as why other people don’t, we can, hopefully, make the arts more accessible to everyone. Thus, for our predictive modeling project, we will be using a dataset we from the Local Area Arts Participation study from 1992. This study was funded by the National Endowment for the Arts (Research Division) and conducted by Abt Associates of Cambridge, MA. It contains information on whether and how often Americans attend and participate in various kinds of local art, including jazz concerts, ballet performances, etc. It also contains various demographic characteristics of the respondents. In collecting the data, Abt Associates randomly selected Americans from 12 localities via a random-digit-dialing sampling methodology and contacted them by phone for this survey.

In total, there are 5,040 total observations and 95 attributes. 72 of the characteristics are factored data while the other 23 are numeric data. The categorical characteristics generally involve whether the respondent attended or participated in local art, as well as demographic data. The numerical features include the number of times the respondent attended that specific event as well as demographic data. The data is read in below, and named local_arts_data.

# Load the Data
load(file = "data/processed/local_arts_data.Rda")

Missingness

There appears to be significant missingness within some characteristics of the data. There are three reasons that this missingness occurs:

  • The respondent did not know the answer.
  • The question did not apply to this respondent.
  • The respondent refused to answer the question.

The majority of NA values are due to the second reason—that the question did not apply to the respondent. For instance, the columns with the most missing values, bar5 and bar4, ask about the fourth and fifth reasons for why the respondent did not attend cultural and artistic events more often. However, the previous three questions, bar1, bar2, and bar3, ask the same question, and it’s likely that most respondents didn’t have more than three reasons for not attending more cultural/artistic events. Another example is mags, which asks what magazines the respondent read to learn about art events in the community, which would be dependent on whether the respondent read magazines and cited them as a source for learning about art events. For those predictors with missing values that do not fall under that category, it is more likely that the respondent did not know the answer to the question than that they refused to answer the question or that the question didn’t apply to them.

Below is a table of the missingness in the local arts data.

# Create function to select columns with missing values
has_na <- function(x){
  any(is.na(x))
}

# Graph missingness
local_arts_data %>%
  # select only columns that have missing values
  select_if(has_na) %>%
  # summary of missing variables
  miss_var_summary() %>% 
  print(n = 20)
## # A tibble: 80 x 3
##    variable n_miss pct_miss
##    <chr>     <int>    <dbl>
##  1 bar5       5029     99.8
##  2 bar4       4946     98.1
##  3 mags       4913     97.5
##  4 more8      4823     95.7
##  5 bar3       4612     91.5
##  6 more7      4385     87.0
##  7 radio      3943     78.2
##  8 more6      3738     74.2
##  9 tvtype     3642     72.3
## 10 bar2       3590     71.2
## 11 more5      2969     58.9
## 12 mostimp    2370     47.0
## 13 more4      2229     44.2
## 14 more3      1570     31.2
## 15 bar1       1483     29.4
## 16 newsp      1432     28.4
## 17 moremost   1124     22.3
## 18 more2      1009     20.0
## 19 over18      940     18.7
## 20 income      915     18.2
## # … with 60 more rows

The truncated table above shows that there are 77 variables with missingness in our data set. In the table, variables from bar5 down to newsp have over 20% of their values as NA. For the variables with more than 20% missingness, it would probably make sense to remove most of them, as more than 20% of observations are a lot to impute. We decided to remove 14 predictors with a significant amount of missingness (bar5, bar4, mags, more8, bar3, more7, radio, more6, tvtype, bar2, more5, more4, more3, newsp). The code for that is below.

local_arts_data <- local_arts_data %>% 
  select(-bar5, -bar4, -mags, -more8, -bar3, -more7, -radio, -more6, 
         -tvtype, -bar2, -more5, -more4, -more3, -newsp)

We decided to keep one variable that had greater than 20% missingness, specifically mostimp, as that question relates to the most important reason the participant did not attend arts events more often.

For the rest of the variables with missingness at or below 20%, an imputation step will be used to replace those missing values.

Outcome Variable

In our predictive modeling project, we will attempt to predict the survey respondent’s household income, the variable income, using the other variables in the data-set related to arts attendence and demographic information as predictors. This is a categorical variable with 9 levels indicating the range of the respondent’s household income. Before conducting an initial split on the data into a training and testing set, we wanted to visualize our outcome variable.

# Distribution of income (graph)
local_arts_data %>%
  ggplot(aes(x = income)) +
  geom_bar() +
  coord_flip() +
  labs(
    title = "Distribution of Respondent's Household Income"
  ) +
  theme_minimal()

# Table of the levels of income
local_arts_data %>% 
  count(income)
## # A tibble: 10 x 2
##    income                 n
##    <fct>              <int>
##  1 Under $10,000        315
##  2 $10,000 to $14,999   288
##  3 $15,000 to $19,999   375
##  4 $20,000 to $29,999   721
##  5 $30,000 to $39,999   679
##  6 $40,000 to $49,999   540
##  7 $50,000 to $74,000   698
##  8 $75,000 to $99,999   243
##  9 $100,000 or more     266
## 10 <NA>                 915

As the graph and table above show, there is a significant amount of missingness in the outcome variable, with a total of 915 missing values. Most of this missingness is because participants didn’t know their household income or because the respondent refused to answer the question, with some NA values being truly missing responses. For the rest of the data, the categories with the highest counts are $50,000 to $74,999, $30,000 to $39,999, and $20,000 to $29,999. Categories with the lowest counts are $75,000 to $99,000 and $100,000 or more. Before splitting the data into a training and test set, it would make sense to get rid of the rows that contain missing values, since it would not make sense (nor would it be possible) to predict missing values. The code for that is below.

# Only keep observations with values for income
local_arts_data <- local_arts_data %>% 
  filter(!is.na(income))

Splitting

We decided to initially split the data into 70% training, and 30% testing. The imbalance between the various classes of income shown above prove that it would be important to stratify our initial split (and later our cross-validation) by income. Below is the code for splitting the data.

# Conduct Initial Split
local_arts_split <- initial_split(data = local_arts_data, prop = 0.7, strata = income)

# Collect Training Data
local_arts_train <- training(local_arts_split)

Essential Findings

After performing some initial cleaning and processing of the dataset, there are 74 variables remaining, 73 of which may serve as predictor variables. There are two sets of variables indicating participation in the arts: one set that is categorical, indicating whether or not the participant has gone to the indicated arts event in the last 12 months. The other set is numeric, indicating how many times the participant has gone to the indicated arts event in the last 12 months. Using both sets of variables might be a bit redundant, so we may focus on the numeric set rather than the categorical set, as that would give us a more nuanced look at not only whether or not the participant has been to certain arts events but the frequency with which they have attended those events. Their participation in the arts most likely correlates with their income bracket, as high participation in the arts may signal more available time for leisure activities.

Activity Frequency Predictor Variables

Taking a look at both the relationship between both the categorical and numeric sets of predictor variables regarding participation in the arts and our outcome variable income shows that there is definitely a trend.

# combined graph of categorical arts participation variables
local_arts_train %>% 
  select(income, jazz, classic, opera, musical, play, ballet, museum, fair) %>%
  gather(-income, key = "var", value = "value") %>%
  ggplot(aes(x = income, fill = value)) +
  geom_bar(position = "fill") +
  coord_flip() +
  facet_wrap(~ var, scales = "free") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(
    title = "Attendance at Arts Activities by Income",
    fill = "Have you\nattended\n(*) in the last\n12 months?",
    x = "prop"
  )

As you can see above, for every kind of local art activity, from live jazz, to a museum, to a staged play, the proportion of people who have attended in the last 12 months increases as the income brackets increases. For almost every local art activity, the income bracket with the most people who attended in the last 12 months is the $100,000 or more income bracket. This makes sense: those with more money have the financial ability and time to attend local arts events. The same pattern seems to hold when we look at the numeric representation of arts attendance and participation.

# combined graph of numeric arts participation variables
local_arts_train %>%
  select(income, starts_with("n")) %>%
  filter(!is.na(income)) %>%
  select(., 1:10) %>%
  gather(-income, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = income)) +
  geom_boxplot() +
  facet_wrap(~ var, scales = "free") +
  theme_minimal() +
  coord_cartesian(xlim = c(0, 10)) +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(
    title = "Frequency of Attendance at Arts Activities by Income",
    x = "number of times attending"
  )

It’s a little hard to see a relationship for a lot of the variables since so many people went zero times to a local arts performance. But for some variables, there does seem to be a positive relationship between the frequency of arts attendance and income. For example, people with an income of 100,000 dollars or more and those with an income between 75,000 and 99,999 visited museums in the last twelve months a median of 2 times, while those with lower household incomes visited museums on the whole less frequently. In addition, people with a household income of 20,000 dollars or more had a higher median number of visits to art fairs (nfair) than those with an income under 20,000 dollars.

Below is another graph of income in relation to more variables indicating the number of times the respondent participated in a specific activity in the last 12 months. nbooks refers to the number of books the respondent read; npark refers to the number of times the respondent visited a history park/monument; ntvart refers to the number of times the respondent watched a visual arts program on TV/VCR; ntvclass refers to the number of times the respondent watched a classical music program on TV/VCR; ntvdance refers to the number of times a respondent watched a dance program on TV; ntvjazz refers to the number of times a respondent watched a jazz program on TV; ntvmus refers to the number of times a respondent watched a musical on TV; ntvopera refers to the number of times a respondent watched opera on TV, and ntvplay refers to the number of times a respondent watched a play on TV.

# combined graph of numeric arts participation variables 2
local_arts_train %>%
  select(income, starts_with("n")) %>%
  filter(!is.na(income)) %>%
  select(., 1, 11:19) %>%
  gather(-income, key = "var", value = "value") %>%
  ggplot(aes(x = value, y = income)) +
  geom_boxplot() +
  facet_wrap(~ var, scales = "free") +
  theme_minimal() +
  coord_cartesian(xlim = c(0, 20)) +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(
    title = "Frequency of Engaging in/Watching Arts Activities by Income",
    x = "number of times engaging"
  )

Again, since the majority of people watched or engaged in most of the above arts activities zero times in the last twelve months, there doesn’t appear to be much of a relationship between most of the variables and income, except for nbooks and npark. There is a very clear relationship between the income of a respondent and the number of books they read in the last twelve months, with those with the highest income reading the most and those with the lowest income reading the least. There also seems to be a slight relationship between the income of a respondent and the number of times they visited a historic park or monument, with those with the highest income visiting historic sites slightly more frequently then those with the lowest income.

Another predictor variable related to arts participation/engagement that we think would be interesting to investigate in relation to income is whether or not a person visited the movies in the past twelve months.

local_arts_data %>% 
  ggplot(aes(income, fill = cinema)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Going to the movies by income",
       fill = "Have you gone to movie theater\nto see a movie\nin last 12 months?",
       y = "prop")

The graph pretty clearly shows that more people with higher incomes went to see a movie in the theaters in the last twelve months than people with lower incomes.

Activity Reasoning Predictor Variables

While the above variables provide broad trends as to the frequency of the participants’ attendance to their local arts events, it may be just as important as to understand why they have and why they haven’t.

For instance, we can take a look at the variable bar1, which indicates respondent’s first reason for not attending local arts events/not attending them more often.

# distribution of bar1
local_arts_train %>% 
  mutate(
    bar1 = bar1 %>% fct_infreq() %>% fct_rev()
  ) %>% 
  ggplot(aes(bar1)) +
  geom_bar() +
  coord_flip() +
  labs(
    x = "reason for not attending"
  ) +
  theme_minimal()

As you can see above, the most common reason for people not attending local arts events more frequently is that they don’t have time, followed by the cost of tickets and lack of childcare. It could also be interesting to see how these reasons for not attending local arts events change (or remain the same) across levels of income.

# distribution of bar1 by income
local_arts_train %>% 
  mutate(
    bar1 = bar1 %>% fct_infreq() %>% fct_rev()
  ) %>% 
  ggplot(aes(bar1)) +
  geom_bar() +
  facet_wrap(~ income) +
  coord_flip() +
  labs(
    x = "reason for not attending"
  ) +
  theme_minimal()

It seems that across all income levels, lack of time is the primary reason for not attending local arts events more frequently.

Other variables that ask about the reason behind their lack of attendance include:

  • mostimp: most important reason participant did not attend arts events more often

  • howimp: how important it is for the participant to attend events

Demographic Predictor Variables

Another important factor to bear in mind when observing and predicting survey respondent’s household income is their demographic information. These variables include:

  • education level

  • gender

  • age

  • household size

  • marital status

Each of these seem as if they could provide valuable insight into the income of the participant. For example, income generally increases with education and age, while a larger household size may indicate higher income with more financial stability.

Below is a plot of the highest education level by income.

local_arts_train %>%
  select(income, educ, race, gender, age, hhsize) %>%
  filter(!is.na(income)) %>%
  ggplot(aes(income, fill = educ)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Distribution of Education Levels across Income Brackets",
    fill = "Education Level",
    y = "prop"
  )

As hypothesized, there appears to be a correlation between education level and income. The proportion of people with bachelor’s degrees and with grad degrees increases as income increases.

Since there are multiple levels of educ, it would make sense to visualize the distribution of this predictor.

# distributin of educ
local_arts_train %>% 
  mutate(
    educ = educ %>% fct_infreq() %>% fct_rev()
  ) %>% 
  ggplot(aes(educ)) +
  geom_bar() +
  coord_flip() +
  theme_minimal()

The levels with the highest counts are Some college, High school, and Bachelors degree. There are a few levels, Some grad. school, Vocational school, Grades k-8, No school, and Other, that have very few values. It might make sense to combine these into a single other level using step_other in our recipe, or employing the step_nzv or step_zv functions to deal with zero-variance predictors.

The next demographic factor is gender. (Race was included as a secondary finding, rather than a primary finding.)

local_arts_train %>%
  select(income, educ, gender, age, hhsize) %>%
  filter(!is.na(income)) %>%
  ggplot(aes(income, fill = gender)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Distribution of Gender across Income Brackets",
    fill = "Gender",
    y = "prop"
  )

Across nearly all the income brackets, the proportion of females outnumber the proportion of males, except in the highest income bracket.

local_arts_train %>%
  select(income, educ, gender, age, hhsize) %>%
  filter(!is.na(income)) %>%
  ggplot(aes(age)) +
  geom_bar() +
  coord_flip() +
  facet_wrap(~ income) +
  theme_minimal() +
  labs(
    title = "Distribution of Age across Income Brackets",
    x = "Age",
    y = "Count"
  )

The distribution of ages is unimodal in all income brackets, with many of the graphs having a left tail. As the income bracket increases, the plot appears to move “upwards.” In other words, the age increases as the income increases, indicating a positive relationship between the two. This would make sense: younger adults tend to have less money due to their lack of experience. Older adults have more experience in the workforce and may be more financially responsible, leading to a higher income.

local_arts_train %>%
  select(income, educ, gender, age, hhsize) %>%
  filter(!is.na(income)) %>%
  ggplot(aes(hhsize)) +
  geom_bar() +
  coord_flip() +
  facet_wrap(~ income, scales = "free") +
  theme_minimal() +
  labs(
    title = "Distribution of Household Size across Income Brackets",
    x = "Household Size",
    y = "Count"
  )

The distribution of household sizes is heavily left skewed in all income brackets. As the income bracket increases, the household size increases as well. One noteworthy observation is that one point, specifically located in the income bracket of $30,000 to $39,999, appears to be an outlier, with a value of 73 for the household size. In future analysis, the household size for that point may be imputed, or the observation may be removed altogether as this is likely a data collection error.

local_arts_train %>%
  select(income, marital) %>%
  filter(!is.na(income)) %>%
  ggplot(aes(income, fill = marital)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Distribution of Marital Status across Income Brackets",
    y = "prop",
    fill = "Marital Status"
  )

Clearly, as the the income increases, so do the proportion of married couples. This makes sense: married participants have their primary source of income as well as their partner’s source of income, leading to a higher household income.

The last demographic variable we wanted to explore with income is race.

local_arts_train %>%
  select(income, race) %>%
  filter(!is.na(income)) %>%
  ggplot(aes(income, fill = race)) +
  geom_bar(position = "fill") +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Race by Income",
    fill = "Count",
    y = "prop"
  )

The graph above shows that in all income brackets, the majority of respondents were White, not Hispanic. However, you can tell that there is a slight increase in the proportion of white people as the income bracket increases, and a subsequent decrease in people of color, particularly black folks, as the income bracket increases. This, sadly, is an unsurprising trend given the perpetuation of systemic racism in America.

It would also make sense to look at the distribution of race by itself given the overrepresentation of White, not Hispanic folks in the above graph.

# distribution of race
local_arts_train %>% 
  mutate(
    race = race %>% fct_infreq() %>% fct_rev()
  ) %>% 
  ggplot(aes(race)) +
  geom_bar() +
  coord_flip() +
  theme_minimal()

White folks vastly outnumber folks of other races in this dataset. There is very little diversity in this variable, which will be a problem when dummy encoding. It would make sense to either combine levels of this variable using a step_other category in our recipe, or employing step_nzv or step_zv to remove near-zero or zero-variance levels.

Since we’ve looked at race and income, it would also be interesting to visualize the relationship between race and arts attendance, focusing specifically on live jazz, classical music, and museums.

# combined graph of categorical arts participation variables
local_arts_train %>% 
  select(race, jazz, classic, museum) %>%
  gather(-race, key = "var", value = "value") %>%
  ggplot(aes(x = race, fill = value)) +
  geom_bar(position = "fill") +
  coord_flip() +
  facet_wrap(~ var, scales = "free", nrow = 2) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(
    title = "Attendance at Arts Activities by Race",
    fill = "Have you\nattended\n(*) in the last\n12 months?",
    x = "prop"
  )

The graph shows that a slightly higher proportion of white folks attended live classical music concerts in the last 12 months than people of other racial identities. A slightly higher proportion of black folks, on the other hand, attended live jazz concerts in the last 12 months. All Alaskan Native folks attended museums in the last 12 months, but this is rather misleading given there are so few Alaskan Native folks in the data set (as seen in the distribution of race graph above). After that, it appears that South American and white folks attended museums at a slightly higher rate than other people.

Secondary Findings

The secondary findings mainly consist of variables that were explored in the process of this exploratory data analysis, but fail to present interesting or surprising results.

Variables without Variance

Using the caret package, we used nearZeroVar() to identify which variables had little to no variance. Such values would fail to provide meaningful insight or significantly affect the model.

local_arts_train %>%
  select(caret::nearZeroVar(local_arts_train)) %>%
  colnames()
## [1] "abtid"    "nopera"   "source4"  "source5"  "exchange"

Thus, these variables should be left out of the dataset and recipe.

Preferences and Subjective Questions

Many questions revolved around musical/artistic preferences of the participant. One such example is lisjazz, which explores whether the participant listened to jazz performances on radio, tapes, or compact discs in the last 12 months. There are 4 other variables like this, but instead recording whether or not a respondent listened to classical, opera, general music or a play on the radio, tapes or compact discs in the last 12 months. A graph of the counts for these variables is below.

p1 <- local_arts_train %>%
  select(starts_with("lis")) %>%
  ggplot(aes(lisjazz)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(x = "Count",
       y = "Listened to Jazz?")

p2 <- local_arts_train %>%
  select(starts_with("lis")) %>%
  ggplot(aes(lisclass)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(x = "Count",
       y = "Listened to Classical?")

p3 <- local_arts_train %>%
  select(starts_with("lis")) %>%
  ggplot(aes(lisopera)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(x = "Count",
       y = "Listened to Opera?")

p4 <- local_arts_train %>%
  select(starts_with("lis")) %>%
  ggplot(aes(lismus)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(x = "Count",
       y = "Listened to Music?")

p5 <- local_arts_train %>%
  select(starts_with("lis")) %>%
  ggplot(aes(lisplay)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(x = "Count",
       y = "Listened to Play?")

plot_grid(p1, p2, p3, p4, p5)

Since the mediums are widely accessible (radio, tapes, or compact discs), whether the participant listened to a type of music seems like it is more about the participant’s preference, rather than any meaningful information about their income level. This is also true of the questions that ask about what kind of events the participant would like to attend more of. The first graph below represents the first type of art or performance a respondent would like to attend more of, while the second graph below representst the second type of art or performance a respondent would like to attend more of.

p1 <- local_arts_train %>%
  select(starts_with("more")) %>%
  ggplot(aes(more1)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(x = "Count",
       y = "I want to attend more...?")

p2 <- local_arts_train %>%
  select(starts_with("more")) %>%
  ggplot(aes(more2)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  labs(x = "Count",
       y = "I want to attend more...?")

plot_grid(p1, p2)

Demographic Variables

One demographic characteristic considered as a potential factor was race. Below is the plot relating race with income.

local_arts_train %>%
  select(income, race) %>%
  filter(!is.na(income)) %>%
  ggplot(aes(race)) +
  geom_bar() +
  coord_flip() +
  facet_wrap(~ income) +
  theme_minimal() +
  labs(
    title = "Race by Income",
    x = "Count",
    y = "Race"
  )

There appears to be little diversity. Instead, the predominate count is White, not Hispanic, which far outnumbers the other races.